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Abstract 



Random forests are a scheme proposed by Leo Breiman in the 2000's 
for building a predictor ensemble with a set of decision trees that grow 
in randomly selected subspaces of data. Despite growing interest and 
practical use, there has been little exploration of the statistical prop- 
erties of random forests, and little is known about the mathematical 
*K^ forces driving the algorithm. In this paper, we offer an in-depth anal- 

^ ysis of a random forests model suggested by Breiman in |12j . which 

^!^ is very close to the original algorithm. We show in particular that 

pq the procedure is consistent and adapts to sparsity, in the sense that 

^D its rate of convergence depends only on the number of strong features 

l/~j and not on how many noise variables are present. 

O 
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1 Introduction 

1.1 Random forests 

In a series of papers and technical reports, Breiman [HI [IDl EH [12] demon- 
strated that substantial gains in classification and regression accuracy can be 
achieved by using ensembles of trees, where each tree in the ensemble is grown 
in accordance with a random parameter. Final predictions are obtained by 
aggregating over the ensemble. As the base constituents of the ensemble are 
tree-structured predictors, and since each of these trees is constructed using 
an injection of randomness, these procedures are called "random forests" . 

Breiman's ideas were decisively infiuenced by the early work of Amit and Ge- 
man [3] on geometric feature selection, the random subspace method of Ho 
[27] and the random split selection approach of Dietterich j2Tj . As highlighted 
by various empirical studies (see [HI [361 1201 [211 [25] for instance), random 
forests have emerged as serious competitors to state-of-the-art methods such 
as boosting (Freund [22]) and support vector machines (Shawe- Taylor and 
Cristianini [35]). They are fast and easy to implement, produce highly ac- 
curate predictions and can handle a very large number of input variables 
without overfitting. In fact, they are considered to be one of the most accu- 
rate general-purpose learning techniques available. The survey by Genuer et 
al. [21] may provide the reader with practical guidelines and a good starting 
point for understanding the method. 

In Breiman's approach, each tree in the collection is formed by first selecting 
at random, at each node, a small group of input coordinates (also called 
features or variables hereafter) to split on and, secondly, by calculating the 
best split based on these features in the training set. The tree is grown 
using CART methodology (Breiman et al. |I3]) to maximum size, without 
pruning. This subspace randomization scheme is blended with bagging ([9l 
IT5| [T6l [1]) to resample, with replacement, the training data set each time a 
new individual tree is grown. 

Although the mechanism appears simple, it involves many different driving 
forces which make it difficult to analyse. In fact, its mathematical properties 
remain to date largely unknown and, up to now, most theoretical studies 
have concentrated on isolated parts or stylized versions of the algorithm. In- 
teresting attempts in this direction are by Lin and Jeon [32] , who establish a 
connection between random forests and adaptive nearest neighbor methods 
(see also [S] for further results); Meinshausen [33], who studies the consis- 
tency of random forests in the context of conditional quantile prediction; 
and Devroye et al. ^, who offer consistency theorems for various simplified 



versions of random forests and other randomized ensemble predictors. Nev- 
ertheless, the statistical mechanism of "true" random forests is not yet fully 
understood and is still under active investigation. 

In the present paper, we go one step further into random forests by working 
out and solidifying the properties of a model suggested by Breiman in ^12] . 
Though this model is still simple compared to the "true" algorithm, it is 
nevertheless closer to reality than any other scheme we are aware of. The 
short draft [I2] is essentially based on intuition and mathematical heuristics, 
some of them are questionable and make the document difficult to read and 
understand. However, the ideas presented by Breiman are worth clarifying 
and developing, and they will serve as a starting point for our study. 

Before we formalize the model, some definitions are in order. Through- 
out the document, we suppose that we are given a training sample Vn = 
{(Xi, Fi), . . . , (X„, Yn)} of i.i.d. [0, 1]^ x M-valued random variables (d > 2) 
with the same distribution as an independent generic pair (X, Y) satisfy- 
ing KY"^ < oo. The space [0, l]'^ is equipped with the standard Euclidean 
metric. For fixed x G [0, 1]'', our goal is to estimate the regression function 
r(x) = E[y|X = x] using the data Vn- In this respect, we say that a regres- 
sion function estimate r„ is consistent if E[r„(X) — r(X)]^ — )■ as n — )■ oo. 
The main message of this paper is that Breiman's procedure is consistent 
and adapts to sparsity, in the sense that its rate of convergence depends only 
on the number of strong features and not on how many noise variables are 
present. 

1.2 The model 

Formally, a random forest is a predictor consisting of a collection of random- 
ized base regression trees {r„(x, 9^, Vn),m > 1}, where 9i, Q2, ■ ■ ■ are i.i.d. 
outputs of a randomizing variable 0. These random trees are combined to 
form the aggregated regression estimate 

f„(X,P„)=Ee[r„,(X,e,P„)], 

where Ee denotes expectation with respect to the random parameter, con- 
ditionally on X and the data set P„. In the following, to lighten notation a 
little, we will omit the dependency of the estimates in the sample, and write 
for example f„(X) instead of f„(X, P„). Note that, in practice, the above ex- 
pectation is evaluated by Monte Carlo, i.e., by generating M (usually large) 
random trees, and taking the average of the individual outcomes (this pro- 
cedure is justified by the law of large numbers, see the appendix in Breiman 
|llj). The randomizing variable is used to determine how the successive 



cuts are performed when building the individual trees, such as selection of 
the coordinate to split and position of the split. 

In the model we have in mind, the variable O is assumed to be independent of 
X and the training sample P„. This excludes in particular any bootstrapping 
or resampling step in the training set. This also rules out any data-dependent 
strategy to build the trees, such as searching for optimal splits by optimizing 
some criterion on the actual observations. However, we allow to be based 
on a second sample, independent of, but distributed as, Vn. This important 
issue will be thoroughly discussed in Section 3. 

With these warnings in mind, we will assume that each individual random 
tree is constructed in the following way. All nodes of the tree are associated 
with rectangular cells such that at each step of the construction of the tree, 
the collection of cells associated with the leaves of the tree (i.e., external 
nodes) forms a partition of [0, l]'^. The root of the tree is [0, l]'^ itself. The 
following procedure is then repeated [log2 kn\ times, where log2 is the base-2 
logarithm, [.] the ceiling function and /c„ > 2 a deterministic parameter, 
fixed beforehand by the user, and possibly depending on n. 

1. At each node, a coordinate of X = {X^^\ . . . ,X*^'^)) is selected, with 
the j-th feature having a probability Pnj G (0, 1) of being selected. 

2. At each node, once the coordinate is selected, the split is at the mid- 
point of the chosen side. 

Each randomized tree r„(X, 0) outputs the average over all Yi for which the 
corresponding vectors Xj fall in the same cell of the random partition as 
X. In other words, letting A„(X, 0) be the rectangular cell of the random 
partition containing X, 

^ /Y Q^ Er=i^^l[x,eA„(x,e)] ., 

r„(X, 0j = ^=^^ — l£-„(x,e), 

Z^i=i -'-[x,eA„{x,0)] 

where the event Sn(X., 0) is defined by 



^n(X,0) 



2J l[x,eA„(x,e)] ¥" 



j=i 



(Thus, by convention, the estimate is set to on empty cells.) Taking finally 
expectation with respect to the parameter 0, the random forests regression 
estimate takes the form 



f„(X)=Ee[r„(X,0)]=E 







Z]i=i^il[x,eA„(x,e)] ., 

-^^^ — z J-^-nCx.e) 



Let us now make some general remarks about this random forests model. 
First of all, we note that, by construction, each individual tree has exactly 
2riog2 ^"1 (~ /i;^) terminal nodes, and each leaf has Lebesgue measure 2~I^'°S2 '^"l 
(~ l/kn)- Thus, if X has uniform distribution on [0, l]*^, there will be on 
average about n/kn observations per terminal node. In particular, the choice 
kn = n induces a very small number of cases in the final leaves, in accordance 
with the idea that the single trees should not be pruned. 

Next, we see that, during the construction of the tree, at each node, each 
candidate coordinate X^^'^ may be chosen with probability Pnj € (0, 1). This 
implies in particular X^fciP^i — 1- Although we do not precise for the 
moment the way these probabilities are generated, we stress that they may 
be induced by a second sample. This includes the situation where, at each 
node, randomness is introduced by selecting at random (with or without 
replacement) a small group of input features to split on, and choosing to 
cut the cell along the coordinate — inside this group — which most decreases 
some empirical criterion evaluated on the extra sample. This scheme is close 
to what the original random forests algorithm does, the essential difference 
being that the latter algorithm uses the actual data set to calculate the best 
splits. This point will be properly discussed in Section 3. 

Finally, the requirement that the splits are always achieved at the middle 
of the cell sides is mainly technical, and it could eventually be replaced by 
a more involved random mechanism — based on the second sample — , at the 
price of a much more complicated analysis. 

The document is organized as follows. In Section 2, we prove that the random 
forests regression estimate f„ is consistent and discuss its rate of convergence. 
As a striking result, we show under a sparsity framework that the rate of 
convergence depends only on the number of active (or strong) variables and 
not on the dimension of the ambient space. This feature is particularly 
desirable in high-dimensional regression, when the number of variables can 
be much larger than the sample size, and may explain why random forests 
are able to handle a very large number of input variables without overfitting. 
Section 3 is devoted to a discussion, and a small simulation study is presented 
in Section 4. For the sake of clarity, proofs are postponed to Section 5. 



2 Asymptotic analysis 

Throughout the document, we denote by A^n(X, 6) the number of data points 
faUing in the same cell as X, i.e., 

n 

iV„(X, e) = 2J l[x,eA„(x,e)]- 
1=1 

We start the analysis with the following simple theorem, which shows that 
the random forests estimate f„ is consistent. 

Theorem 2.1 Assume that the distribution of X has support on [0, 1]'^. 
Then the random forests estimate f„ is consistent whenever pnj log k^ ^ 00 
for all j = 1, . . . ,d and kn/n — )■ as n -^ 00. 



Theorem |2. 1| mainly serves as an illustration of how the consistency problem 
of random forests predictors may be attacked. It encompasses, in particular, 
the situation where, at each node, the coordinate to split is chosen uniformly 
at random over the d candidates. In this "purely random" model, pnj = l/<i, 
independently of n and j, and consistency is ensured as long as /;;„—!■ 00 
and kn/n -^ 0. This is however a radically simplified version of the random 
forests used in practice, which does not explain the good performance of the 
algorithm. To achieve this goal, a more in-depth analysis is needed. 

There is empirical evidence that many signals in high-dimensional spaces 
admit a sparse representation. As an example, wavelet coefficients of im- 
ages often exhibit exponential decay, and a relatively small subset of all 
wavelet coefficients allows for a good approximation of the original image. 
Such signals have few non-zero coefficients and can therefore be described 
as sparse in the signal domain (see for instance [S]). Similarly, recent ad- 
vances in high-throughput technologies — such as array comparative genomic 
hybridization — indicate that, despite the huge dimensionality of problems, 
only a small number of genes may play a role in determining the outcome 
and be required to create good predictors ([38] for instance). Sparse estima- 
tion is playing an increasingly important role in the statistics and machine 
learning communities, and several methods have recently been developed in 
both fields, which rely upon the notion of sparsity (e.g. penalty methods like 
the Lasso and Dantzig selector, see [371 [THl [IZl E] cind the references therein). 

Following this idea, we will assume in our setting that the target regression 
function r(X) = E[F|X], which is initially a function of X = {X^^\ . . . , X^^)), 



depends in fact only on a nonempty subset S (for Strong) of the d features. 
In other words, letting X^ = {Xj : j E S) and S = Card S, we have 

r(X) = E[Y\Xs] 

or equivalently, for any x G [0, l]'^, 

r(x)=r*(x5) /i-a.s., (2.1) 

where /j, is the distribution of X and r* : [0,1]"^ — )■ M is the section of r 
corresponding to S. To avoid trivialities, we will assume throughout that S 
is nonempty, with S* > 2. The variables in the set W = {1, . . . , d} — S (for 
Weak) have thus no influence on the response and could be safely removed. 
In the dimension reduction scenario we have in mind, the ambient dimension 
d can be very large, much larger than the sample size n, but we believe that 
the representation is sparse, i.e., that very few coordinates of r are non-zero, 
with indices corresponding to the set S. Note however that representation 



(2.1 ) does not forbid the somehow undesirable case where S = d. As such, the 
value S characterizes the sparsity of the model: The smaller S, the sparser 
r. 

Within this sparsity framework, it is intuitively clear that the coordinate- 
sampling probabilities should ideally satisfy the constraints Pnj = 1/S for 
j E S (and, consequently, Pnj = otherwise). However, this is a too strong 
requirement, which has no chance to be satisfied in practice, except maybe 
in some special situations where we know beforehand which variables are 
important and which are not. Thus, to stick to reality, we will rather require 
in the following that Pnj = (l/S')(l + ^„j) for j E S (and Pnj = Cnj otherwise), 
where Pnj € (0, 1) and each ^nj tends to as n tends to infinity. We will 
see in Section 3 how to design a randomization mechanism to obtain such 
probabilities, on the basis of a second sample independent of the training set 
Vn- At this point, it is important to note that the dimensions d and S are 
held constant throughout the document. In particular, these dimensions are 
not functions of the sample size n, as it may be the case in other asymptotic 
studies. 

We have now enough material for a deeper understanding of the random 
forests algorithm. To lighten notation a little, we will write 

l[x,gA„(x,e) ] 

A^n(X,0) 

so that the estimate takes the form 



W (^ P>\ - -"-[^^e-^ni^.w)] 



fniX) = Y,^e[Wni{X,Q)]Yi. 



i=l 



Let us start with the variance/bias decomposition 

E [f„(X) - r(X)]' = E [f„(X) - f„(X)]' + E [f„(X) - r{X)f , (2.2) 

where we set 

n 

f„(X) = ^ Ee [l^n^(X, 6)] r{Xi). 

i=l 



The two terms of (2.2) will be examined separately, in Proposition 2.1 and 



Proposition 2.2, respectively. Throughout, the symbol V denotes variance. 



Proposition 2.1 Assume that X is uniformly distributed on [0, l]*^ and, for 
all X G R"^, 

a2(x)=V[r|X = x] <(t2 

/or some positive constant cr^. Then, if Pnj = {^/S){1 + ^nj) forj G iS, 



C2 



E[f„(X)-f„(X)]^<C(T 



2 \ S'/2d 



5-1 



;i+e. 



fcji 



n(logA;„)^/2'=«' 



where 



C 



288 /7rlog2 



TT \ 16 



5/2d 



The sequence (^„) depends on the sequences {{inj) '■ 3 ^ S} only and tends 
to Q as n tends to infinity. 



Remark 1 A close inspection of the end of the proof of Proposition |2.1 
reveals that 






;i+u-Mi-5^ 



i"! i/2rf 



In particular, if a < Pnj < b for some constants a,b E (0, 1), then 



l + ^n< 



S-1 



S^a{l-b) 



S/2d 



The main message of Proposition 2A_ is that the variance of the forests es- 
timate is 0{kn/{n{\ogkn)^^'^'^)). This result is interesting by itself since it 
shows the effect of aggregation on the variance of the forest. To understand 
this remark, recall that individual (random or not) trees are proved to be 
consistent by letting the number of cases in each terminal node become large 
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(see [m Chapter 20]), with a typical variance of the order kn/n. Thus, for 
such trees, the choice kn = n (i.e., about one observation on average in each 
terminal node) is clearly not suitable and leads to serious overfitting and 
variance explosion. On the other hand, the variance of the forest is of the 
order /c„/(n(log A;„)'^/^'^). Therefore, letting kn = n, the variance is of the 
order l/(logn)'^/^'^, a quantity which still goes to as n grows! Proof of 
Proposition |2.1| reveals that this log term is a by-product of the G-averaging 
process, which appears by taking into consideration the correlation between 
trees. We believe that it provides an interesting perspective on why random 
forests are still able to do a good job, despite the fact that individual trees 
are not pruned. 

Note finally that the requirement that X is uniformly distributed on the 
hypercube could be safely replaced by the assumption that X has a density 
with respect to the Lebesgue measure on [0, l]'^ and the density is bounded 
from above and from below. The case where the density of X is not bounded 
from below necessitates a specific analysis, which we believe is beyond the 
scope of the present paper. We refer the reader to [5] for results in this 
direction (see also Remark 5 in Section 5). 



Let us now turn to the analysis of the bias term in equality (2.2). Recall that 
r* denotes the section of r corresponding to S. 

Proposition 2.2 Assume that X is uniformly distributed on [0, l]'^ and r* 
is L-Lipschitz on [0, 1]'^. Then, if Pnj = (1/5')(1 + ^nj) for j G S, 



ocr2 

E[r-.(X)-r(X)f < J .^. + 



sup r^(x) 
xe[o,i]'* 



-n/2kn 



where 7„ = miujg^ ^„j tends to as n tends to infinity. 

This result essentially shows that the rate at which the bias decreases to 
depends on the number of strong variables, not on d. In particular, the quan- 
tity kn~ ' °^ '^ should be compared with the ordinary partitioning 
estimate bias, which is of the order k„~^' under the smoothness conditions 



of Proposition 2.2 (see for instance [26J). In this respect, it is easy to see 
that fc^-(o.75/(5feg^)(i+7.) = o(A;„-2/'^) as soon as 5 < [0.54ciJ ([.J is the in- 
teger part function). In other words, when the number of active variables is 
less than (roughly) half of the ambient dimension, the bias of the random 
forests regression estimate decreases to much faster than the usual rate. 
The restriction 5* < [0.54(iJ is not severe, since in all practical situations we 
have in mind, d is usually very large with respect to S (this is, for instance, 
typically the case in modern genome biology problems, where d may be of the 



order of billion, and in any case much larger than the a ctua l number of active 
features). Note at last that, contrary to Proposition 2.1, the term e~"/^'^" 
prevents the extreme choice kn = n (about one observation on average in 



each terminal node). Indeed, an inspection of the proof of Proposition 2.2 
reveals that this term accounts for the probability that A'^„(X, B) is precisely 
0, i.e., An(X., 0) is empty. 

Recalling the elementary inequality ze~"'^ < e^^/n for z G [0,1], we may 



finally join Proposition 2.1 and Proposition 2.2 and state our main theorem. 



Theorem 2.2 Assume that X is uniformly distributed on [0, 1]'', r* is L- 
Lipschitz on [0, 1]'^ and, for all x G M.^, 

a2(x)=V[F|X = x] <a^ 

for some positive constant o^ . Then, ifpnj = i^/S){l + C,nj) for j G S , letting 
7„ = Ymiij^^sinj, we have 

E[f„(X)-r(X)]^<H„^ + — ^^^' 



n 






Slog 2 



(l+7n) ' 



where 



and 



Ca' 



S' 



5-1 



S/2d 



;i + a) + 2e-i 



sup r (xj 

xe[o,i]<* 



c 



288 f7c\og2 
16 



n 



S/2d 



The sequence (^„) depends on the sequences {{^nj) '■ j ^ <S} only and tends 
to as n tends to infinity. 

As we will see in Section 3, it may be safely assumed that the randomization 
process allows for C,nj logra — )■ as n — )■ oo, for all j G S. Thus, under this 



condition. Theorem 2.2 shows that with the optimal choice 

kn ocn^^^^^si^\ 
we get 



E [f„(X) - r(X)] = O Usiog2+o.75 

This result can be made more precise. Denote by J^s the class of (L, o"^)- 
smooth distributions (X, Y) such that X has uniform distribution on [0, 1]*^, 
the regression function r* is Lipschitz with constant L on [0, 1]"^ and, for all 
X G M'^, a2(x) = Y[Y I X = x] < a^. 
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Corollary 2.1 Let 

and 



S' 



S-l 



S/2d 



+ 2e~^ 



sup r (xj 

xe[o,i]'' 



c 



288 /7rlog2\ 



vr 



m 



S/2d 



Then, if Pnj = (1/5') (1 + Cnj) for j G S, with ^nj logn — > as n 
the choice 



oo, for 



/^n ex — 



2\ 1/(1+ 



0.75 
Slog 2 



n 



1/(1+1 



-,) 



we have 



lim sup sup 



E[f„,(X)-r(X)]' 



<A, 



(x,y)eJ-s /^ 



„ 25 log 2 
^L 0.75 



5 log 2+0.75 



77, Slog 2+0.75 



where A is a positive constant independent of r, L and cr^. 

This result reveals the fact that the L2-rate of convergence of fn(X) to r(X) 
depends only on the number S of strong variables, and not on the ambient 



dimension d. The main message of Corollary |2.1| is that if we are able to 
properly tune the probability sequences {pnj)n>i and make them sufficiently 
fast to track the informative features, then the rate of convergence of the 

-0.75 

random forests estimate will be of the order n^i°g2+o.75. This rate is strictly 
faster than the usual rate n~^/*^'^+^) as soon as S* < [0.54(iJ. To understand 
this point, just recall that the rate n"^/*^'*"'"^^ is minimax optimal for the 
class J-rf (see for example Ibragimov and Khasminskii [2H1 1221 EO]), seen as 
a collection of regression functions over [0, 1]*^, not [0, 1]'^. However, in our 
setting, the intrinsic dimension of the regression problem is 5, not d, and the 
random forests estimate cleverly adapts to the sparsity of the problem. As an 
illustration. Figure [I] shows the plot of the function 5* i— )■ 0.75/(5* log 2 + 0.75) 
for S ranging from 2 to rf = 100. 

It is noteworthy that the rate of convergence of the E,nj to (and, conse- 
quently, the rate at which the probabilities Pnj approach 1/5 for j G S) will 
eventually depend on the ambient dimension d through the ratio S/d. The 
same is true for the Lipschitz constant L and the factor sup^gjo^ij^ r^(x) which 
both appear in Corollary 2A_ To figure out this remark, remember first that 
the support of r is contained in M'^, so that the later supremum (respectively, 
the Lipschitz constant) is in fact a supremum (respectively, a Lipschitz con- 
stant) over MP , not over W^. Next, denote by Cp{s) the collection of functions 
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Figure 1: Solid line: Plot of the function S ^ 0.75/(^log2 + 0.75) for 
S ranging from 2 to d = 100. Dotted line: Plot of the minimax rate power 
S n- 2/(5' + 2). The horizontal line shows the value of the rf-dimensional rate 
power 2/(rf + 2) ^0.0196. 



rj : [0, 1]^ — )■ [0, 1] for which each derivative of order s satisfies a Lipschitz 
condition. It is well known that the e-entropy log2(7\4) of Cp(s) is $(£:~p/('*+^)) 
as e J, (Kolmogorov and Tihomirov [31]), where a„ = $(6„) means that 
a-n = 0{bn) and bn = 0{an). Here we have an interesting interpretation of 
the dimension reduction phenomenon: Working with Lipschitz functions on 
M"^ (that is, s = 0) is roughly equivalent to working with functions on M'^ for 
which all [{d/S) — l]-th order derivatives are Lipschitz! For example, if 5 = 1 
and d = 25, [d/S) — 1 = 24 and, as there are 25^^ such partial derivatives 
in M^^, we note immediately the potential benefit of recovering the "true" 
dimension S. 

-0.75 

Remark 2 The reduced-dimensional rate n-^iog 2+0.75 jg strictly larger than 
the iS-diniensional optimal rate n~'^^^^^'^\ which is also shown in Figure [ij for 
S ranging from 2 to 100. We do not know whether the latter rate can be 
achieved by the algorithm. ■ 
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Remark 3 The optimal parameter kn of Corollary 2.1| depends on the un- 



known distribution of (X, Y), especially on the smoothness of the regression 
function and the effective dimension S. To correct this situation, adaptive 
(i.e., data-dependent) choices of kn, such as data-splitting or cross-validation, 
should preserve the rate of convergence of the estimate. Another route we 
may follow is to analyse the effect of bootstrapping the sample before growing 
the individual trees (i.e., bagging). It is our behef that this procedure should 
also preserve the rate of convergence, even for overfitted trees (/c„ ^ n), in 
the spirit of |1]. However, such a study is beyond the scope of the present 
paper. ■ 

Remark 4 For further references, it is interesting to note that Proposition 



|2.1 (variance term) is a consequence of aggregation, whereas Proposition 2.2 
(bias term) is a consequence of randomization. 

It is also stimulating to keep in mind the following analysis, which has been 
suggested to us by a referee. Suppose, to simplify, that Y = r(X) (no-noise 
regression) and that J27=i ^ni(X., O) = 1 a.s. In this case, the variance term 
is and we have 

n 

f„(x) = f„(x) = 5]Ee [iVn^(e,x)] r,. 

SetZ„ = (F,ri,...,y„). Then 



E 


[f„(X) - r(X)]^ 
= E [f„(X) - Y] 


2 








= E [E [(f„(X) - 


-Yf 


|z„]; 






= E [E [(f„(X) - 


- E[f, 


.(X)|Z 


.]? 




+ E[E[f,(X)| 


Zn]- 


'Y]\ 





z„]] 



The conditional expectation in the first of the two terms above may be rewrit- 
ten under the form 

E [Gov (Ee [r„(X, O)] , Eq' [r„(X, 6')] | Z„)] , 

where 0' is distributed as, and independent of, 6. Attention shows that this 
last term is indeed equal to 

E [Ee.e'Cov (r„(X, 9), r„(X, 9') | Z„)] 

The key observation is that if trees have strong predictive power, then they 
can be unconditionally strongly correlated while being conditionally weakly 
correlated. This opens an interesting line of research for the statistical analy- 
sis of the bias term, in connection with Amit ^ and Blanchard [Sj conditional 
covariance-analysis ideas. ■ 
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3 Discussion 

The results which have been obtained in Section 2 rely on appropriate behav- 
ior of the probability sequences {pnj)n>i, j = I, ■ ■ ■ ,d. We recall that these 
sequences should be in (0, 1) and obey the constraints pnj = (1/5) (1 + ^nj) 
for j E S (and pnj = ^nj otherwise), where the {^nj)n>i tend to as ra 
tends to infinity. In other words, at each step of the construction of the 
individual trees, the random procedure should track and preferentially cut 
the strong coordinates. In this more informal section, we briefly discuss a 
random mechanism for inducing such probability sequences. 

Suppose, to start with an imaginary scenario, that we already know which 
coordinates are strong, and which are not. In this ideal case, the random 
selection procedure described in the introduction may be easily made more 
precise as follows. A positive integer M„ — possibly depending on n — is fixed 
beforehand and the following splitting scheme is iteratively repeated at each 
node of the tree: 

1. Select at random, with replacement, M„ candidate coordinates to split 
on. 

2. If the selection is all weak, then choose one at random to split on. If 
there is more than one strong variable elected, choose one at random 
and cut. 

Within this framework, it is easy to see that each coordinate in S will be cut 
with the "ideal" probability 



'" s 



' - d 



o\ M„ 



Though this is an idealized model, it already gives some information about 
the choice of the parameter M„, which, in accordance with the results of 



Section 2 (Corollary 2.1), should satisfy 



1 1 logn — > as n — )■ oo. 

a J 

This is true as soon as 

Mn — )• C)0 and )■ oo as n — )■ cxd. 

logn 

This result is consistent with the general empirical finding that M„ (called 
mtry in the R package RandomForests) does not need to be very large (see, 
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for example, Breiman [H]), but not with the widespread behef that M„ 
should not depend on n. Note also that if the M„ features are chosen at 
random without replacement, then things are even more simple since, in this 
case, p* = 1/ S for all n large enough. 

In practice, we have only a vague idea about the size and content of the set 
S. However, to circumvent this problem, we may use the observations of an 
independent second set V^ (say, of the same size as Vn) in order to mimic 
the ideal split probability p*. To illustrate this mechanism, suppose — to keep 
things simple — that the model is linear, i.e., 

Y = Y^ ajX'^^^ + e, 

where X = {X^^\ . . . ,X^'^^) is uniformly distributed over [0, 1]*^, the aj are 
non-zero real numbers, and e is a zero-mean random noise, which is assumed 
to be independent of X and with finite variance. Note that, in accordance 
with our sparsity assumption, r(X) = J2jes '^j^^''^ depends on X^ only. 

Assume now that we have done some splitting and arrived at a current set 
of terminal nodes. Consider any of these nodes, say A = Y[i=i^j' ^^ ^ 
coordinate j G {l,...,d}, and look at the weighted conditional variance 
V[F|X(-') G Aj]P(X(^) G Aj). It is a simple exercise to prove that if X is 
uniform and j G iS, then the split on the j-th side which most decreases the 
weighted conditional variance is at the midpoint of the node, with a variance 
decrease equal to a^/16 > 0. On the other hand, if j G W, the decrease of 
the variance is always 0, whatever the location of the split. 

On the practical side, the conditional variances are of course unknown, but 
they may be estimated by replacing the theoretical quantities by their re- 
spective sample estimates (as in the CART procedure, see Breiman et al. pTl 
Chapter 8] for a thorough discussion) evaluated on the second sample V'^. 
This suggests the following procedure, at each node of the tree: 

1. Select at random, with replacement, M„ candidate coordinates to split 
on. 

2. For each of the M„ elected coordinates, calculate the best split, i.e., 
the split which most decreases the within-node sum of squares on the 
second sample V^. 

3. Select one variable at random among the coordinates which output the 
best within-node sum of squares decreases, and cut. 
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This procedure is indeed close to what the random forests algorithm does. 
The essential difference is that we suppose to have at hand a second sample 
P^, whereas the original algorithm performs the search of the optimal cuts 
on the original observations Vn- This point is important, since the use of 
an extra sample preserves the independence of G (the random mechanism) 
and Vn (the training sample). We do not know whether our results are 
still true if 9 depends on V„ (as in the CART algorithm), but the analysis 
does not appear to be simple. Note also that, at step [SJ a threshold (or 
a test procedure, as suggested in Amaratunga et al. pj) could be used to 
choose among the most significant variables, whereas the actual algorithm 
just selects the best one. In fact, depending on the context and the actual 
cut selection procedure, the informative probabilities Pnj (j G S) may obey 
the constraints Pnj — )■ Pj as ra — > oo (thus, pj is not necessarily equal to 1/5*), 
where the pj are positive and satisfy J2jesPj ~ ^- This should not affect the 
results of the article. 

This empirical randomization scheme leads to complicate probabilities of cuts 
which, this time, vary at each node of each tree and are not easily amenable 
to analysis. Nevertheless, observing that the average number of cases per 
terminal node is about n/kn, it may be inferred by the law of large numbers 
that each variable in S will be cut with probability 



Pnj 



1 




/ 


gyin- 


5 


1 - 


1 - 

V 


~1) 



;i + c 



n] )■, 



where C,nj is of the order 0{kn/n), a quantity which anyway goes fast to as 
n tends to infinity. Put differently, for j G 5, 

Pnj ^^ ~Q [^ ' ^nj ) ) 

where $,nj goes to and satisfies the constraint ^nj log n — )■ as n tends to 
infinity, provided kn\ogn/n — )• 0, M„ — )■ oo and M„/logn — )■ oo. This is 
coherent with the requirements of Corollary |2.1 We realize however that 



this is a rough approach, and that more theoretical work is needed here to 
fully understand the mechanisms involved in CART and Breiman's original 
randomization process. 

It is also noteworthy that random forests use the so-called out-of-bag samples 
(i.e., the bootstrapped data which are not used to fit the trees) to construct 
a variable importance criterion, which measures the prediction strength of 
each feature (see, e.g., Genuer et al. [25]). As far as we are aware, there is to 
date no systematic mathematical study of this criterion. It is our belief that 
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such a study would greatly benefit from the sparsity point of view developed 
in the present paper, but is unfortunately much beyond its scope. Lastly, it 
would also be interesting to work out and extend our results to the context 
of unsupervised learning of trees. A good route to follow with this respect is 
given by the strategies outlined in Amit and Geman [3l Section 5.5]. 

4 A small simulation study 

Even though the first vocation of the present paper is theoretical, we offer in 
this short section some experimental results on synthetic data. Our aim is 
not to provide a thorough practical study of the random forests method, but 
rather to illustrate the main ideas of the article. As for now, we let U{[0, 1]'^) 
(respectively, ^/'(0, 1)) be the uniform distribution over [0, l]'^ (respectively, 
the standard Gaussian distribution). Specifically, three models were tested: 

1. [Sinus] For x G [0, l]'^, the regression function takes the form 

r(x) = 10sin(107rx(^)). 
We let Y = r(X) + e and X ~ W([0, l]'^) {d > 1), with e ~ A/'(0, 1). 

2. [Friedman 7^1] This is a model proposed in Friedman [2^. Here, 

r(x) = 10sin(7ra;(^)a;(2)) + 20(x(^) - .05)^ + lOx^^^ + 5x(^) 
and Y = r(X) + e, where X ~ W([0, 1]"^) {d > 5) and e ~ ^f{0, 1). 

3. [Tree] In this example, we let Y = r(X) + e, where X ~ W([0, 1]'^) 
{d > 5), £ ~ Af{0,l) and the function r has itself a tree structure. 
This tree-type function, which is shown in Figure [2| involves only five 
variables. 

We note that, although the ambient dimension d may be large, the effective 
dimension of model 1 is S" = 1, whereas model 2 and model 3 have S = 5. 
In other words, S = {1} for model 1, whereas S = {1, . . . , 5} for model 2 
and model 3. Observe also that, in our context, the model Tree should be 
considered as a "no-bias" model, on which the random forests algorithm is 
expected to perform well. 

In a first series of experiments, we let d = 100 and, for each of the three 
models and different values of the sample size n, we generated a learning set 
of size n and fitted a forest (10 000 trees) with mtry = d. For j = 1, . . . ,d, the 
ratio (number of times the j-th coordinate is split) /(total number of splits 
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Figure 2: The tree used as regression function in the model Tree. 



over the forest) was evaluated, and the whole experiment was repeated 100 
times. Figure |3j Figure |4] and Figure [5] report the resulting boxplots for each 
of the first twenty variables and different values of n. These figures clearly 
enlighten the fact that, as n grows, the probability of cuts does concentrate 
on the informative variables only and support the assumption that ^nj — ^ 
as n — 7- oo for each j G S. 

Next, in a second series of experiments, for each model, for different values of 
d and for sample sizes n ranging from 10 to 1000, we generated a learning set 
of size n, a test set of size 50 000 and evaluated the mean squared error (MSE) 
of the random forests (RF) method via the Monte Carlo approximation 



50 000 



MSE 



50 000 



^ [RF(test data #j) - r(test data #j)]^ 



All results were averaged over 100 data sets. The random forests algorithm 
was performed with the parameter mtry automatically tuned by the R pack- 
age RandomForests, 1000 random trees and the minimum node size set to 
5 (which is the default value for regression). Besides, in order to compare 
the "true" algorithm with the approximate model discussed in the present 
document, an alternative method was also tested. This auxiliary algorithm 
has characteristics which are identical to the original ones (same mtry, same 
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Figure 3: Boxplots of the empirical probabilities of cuts for model Sinus {S 

{I})- 



number of random trees), with the notable difference that now the maximum 
number of nodes is fixed beforehand. For the sake of coherence, since the 
minimum node size is set to 5 in the RandomForests package, the number of 
terminal nodes in the custom algorithm was calibrated to \n/5] . It must be 
stressed that the essential difference between the standard random forests al- 
gorithm and the alternative one is that the number of cases in the final leaves 
is fixed in the former, whereas the latter assumes a fixed number of termi- 
nal nodes. In particular, in both algorithms, cuts are performed using the 
actual sample, just as CART does. To keep things simple, no data-splitting 
procedure has been incorporated in the modified version. 

Figure |6| Figure [7] and Figure |8] illustrate the evolution of the MSE value 
with respect to n and d, for each model and the two tested procedures. 



19 







5 7 9 11 13 15 17 19 







13 5 7 



3 15 17 19 



Variable Index 



Figure 4: Boxplots of the empirical probabilities of cuts for model Friedman 

#1(5 = {1,...,5}). 



First, we note that the overall performance of the alternative method is very 
similar to the one of the original algorithm. This confirms our idea that 
the model discussed in the present paper is a good approximation of the 
authentic Breiman's forests. Next, we see that for a sufficiently large n, the 
capabilities of the forests are nearly independent of d, in accordance with the 
idea that the (asymptotic) rate of convergence of the method should only 



depend on the "true" dimensionality 5" (Theorem 2.2). Finally, as expected 



it is noteworthy that both algorithms perform well on the third model, which 
has been precisely designed for a tree-structured predictor. 



20 



ffiByBBBaeggaagBBdgB 



3 5 7 9 11 13 15 17 19 
Variable index 





1 3 5 7 9 11 13 15 17 19 



Br 


B 


eBBogBBBaBaegeB 



3 5 7 9 11 13 15 17 19 

Variable index 





Figure 5: Boxplots of the empirical probabilities of cuts for model Tree {S 

{1,...,5}). 



5 Proofs 

Throughout this section, we will make repeated use of the following two facts. 

Fact 5.1 Let i^„j(X, 0) be the number of times the term,inal node A„(X, 0) 
is split on the j-th coordinate (j = l,...,d). Then, conditionally on X, 
Knj(X.,Q) has binomial distribution with parameters \\0g2 kn] and pnj (by 
independence ofX. and Q). Moreover, by construction, 



J2Knj{x,e) = nog,K]. 



i=i 
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Figure 6: Evolution of the MSE for model Sinus {S = 1). 



Recall that we denote by A^„(X, O) the number of data points faUing in the 
same cell as X, i.e., 



A^„(X, 6) = 2jl[x,eA„(x,e)]- 



i=l 



Let A be the Lebesgue measure on [0, l]'^. 
Fact 5.2 By construction, 

A(A„(X,e)) = 2-r'°e2M. 

In particular, if X is uniformly distributed on [0, 1]'^, then the distribution 
of Nn(K,Q) conditionally on X and B is binomial with parameters n and 
2-riog2 fcnl ^]yy independence of the random variables X, Xi, . . . , X„, Q). 

Remark 5 If X is not uniformly distributed but has a probability density 
/ on [0,1]'^, then, conditionally on X and G, A^„(X,B) is binomial with 
parameters n and P(Xi G ^„(X, G) | X, B). If / is bounded from above and 
from below, this probability is of the order A(y4„(X, 6)) = 2^^^°^'2''"\ and 
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Figure 7: Evolution of the MSE for model Friedman 7^1 {S = 5). 



the whole approach can be carried out without difficuhy. On the other hand, 
for more general densities, the binomial probability depends on X, and this 
makes the analysis significantly harder. ■ 



5.1 Proof of Theorem 12.11 

Observe first that, by Jensen's inequality. 



E [f„(X) - r(X)]' = E [Ee [r„(X, 9) - r{X.)]f 
<E[r„(X,e)-r(X)]^ 



A slight adaptation of Theorem 4.2 in Gyorfi et al. ^26j shows that f„ is 
consistent if both diam(A„(X, 6)) — t- in probability and iV„(X, 6) — t- 00 
in probability. 

Let us first prove that A''„(X, 6) — )■ 00 in probability. To see this, consider 
the random tree partition defined by O, which has by construction exactly 
2riog2A:nl rectangular cells, say Ai, . . . , A2nos2k„^ ■ Let Ni, . . . , iVgfiogzfcni denote 
the number of observations among X,Xi, . . . ,X„ falling in these 2l^'°S2'^"l 
cells, and let C = {X, Xi, . . . , X„} denote the set of positions of these n + 1 
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Figure 8: Evolution of the MSE for model Tree {S = 5). 

points. Since these points are independent and identically distributed, fixing 
the set C and G, the conditional probability that X falls in the i-th cell equals 
Ne/{n + 1). Thus, for every fixed M > 0, 

P (A^n(X, 6) < M) = E [P (Ar„(X, 6) < M I C, 0)] 

Np 



E 



< 



£=l,...,2ri°g2fenl:7V^<M 



n + 1 



< 



n + 1 
2MK 



n + 1' 
which converges to by our assumption on kn- 

It remains to show that diam(74„(X, 6)) — )■ in probability. To this aim, 
let Kij(X, B) be the size of the j-th dimension of the rectangle containing 
X. Clearly, it suffices to show that V^j(X, B) — > in probability for all 
j = 1, . . . , (i. To this end, note that 

Ki(X,B) = 2-^-(^'®), 
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where, conditionally on X, Knj(X., O) has a binomial B{\\og2 kn],Pnj) distri- 
bution, representing the number of times the box containing X is split along 
the j'-th coordinate (Fact 5.1[). Thus 



= E[E[2-^-(^'®)|X]] 



which tends to as pnj log A;„ — )■ oo. 



5.2 Proof of Proposition |2.1 

Recall that 



r^{X) = J2^e[Wni{X,e)]Yi, 



i=l 



where 

and 
Similarly, 

We have 



Wni(x,e) 



l[x,eA„(x,0)] 



^„ = [iV,(X,0)^O]. 

n 

(X) = 5^Ee[iy™(X,e)]r(X,). 



E[f„(X)-f„(X)]' = E 



E 



Y,^e[Wm{X,Q)]{Yi-r{X,)) 



i=l 



Y,K[Wm{^,Q)]{Y,-r{X,)y 



i=l 



(the cross terms are since E[yj|Xi] = r{Xi)) 



E 



Y,K[WnAX,Q)]a\X,, 



i=l 



<a^E 



Y,KWni{^m 



i=l 



na' 



=E[E|[iy„i(X,e)]], 
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where we used a symmetry argument in the last equahty. Observe now that 

E| [W^„i(X, 0)] = E© [Py„i(X, 0)] Ee' [W^„i(X, 0')] 

(where 0' is distributed as, and independent of, 0) 
= Ee,e'[W^ni(X,0)l^„i(X,0')] 

l[XieA„(x,0)]l[XieA„(x,e')] 



E 



0,0' 



l£-„(X,0)l£-„(X,0') 



E 



0,0' 



iv„(x,0)iv„(x,0O 

l[XigA„(X,0)nA„(X,0')] ^ ^ 

iV„(X,0)iV„(X,0') ^"^^'®^ ^"^^'®'^ 



Consequently, 



E[f„(X)-f„(X)]'<na2E 

Therefore 

E[f„(X)-f„(X)]' 

<n(7^E 



-'-[XieA„(x,0)nA„(x,0')] 
iV„(X,0)iV„(X,0O 



l£-„(X,0)l£-„(X,0') 



[XieA„(x,0)nA„(x,0')] 



(l + YJi=2 1[X,GA„{X,0)]) (l + YJi=2 1[X,gA„(X,0')]) 



nci^E 



E 



-'-[XigAn(X,0)nA„(X,0')] 
(l + YJi=2 l[X,eA„(X,0)]) 

1 

(l + Er=2 l[X,eA„(X,0')]) 



X,Xi,0,0' 



na^E 



l[XieA„(x,0)nA„(x,0')]E 



na^E 



(l + Ya=2 l[X,eA„(X,0')]) 



l[XieA„(x,0)nA„(x,0')]E 



(l + YJi=2 l[X,eA„(X,0)]) 

X,Xi,0,0' 



X 



'l + Yh=2 1[X,gA„(X,0')]) 



(l + YJi=2 l[X,eA„(X,0)]) 

X,0,0' 



by the independence of the random variables X, Xi, . . . , X„, 0, 0'. Using the 
Cauchy-Schwarz inequality, the above conditional expectation can be upper 
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bounded by 



Ei/2 



1 

(l + YJi=2 l[x,eA„(x,0)]) 

X Ei/2 ^ 



x,e 



2 I "•-' 



(l + Zir=2 l[X,GA„(X,e')]) 



X,0' 



2 I "•-' 



< 



3 X 2^'^'°^2'=n 



n^ 



< 



(by Fact 5.2 and technical Lemma 5.1) 
12P 



n^ 



It follows that 



12(T^A;^ 
E [f„(X) - f„(X)]' < ^^E [l[x,eA„(x,e)nA„(x,0')]] 

120-2 A;^ 

= — ^E [Exi [l[Xi6A„(x,e)nA„(x,0')]]] 

= n E [Px^ (Xi G A„(X, 9) n A„(X, 6'))] . (5.1) 

n 

Next, using the fact that Xi is uniformly distributed over [0, 1]"', we may 
write 

Px, (Xi G A„(x, 6) n A„(x, e')) = A (A„(x, e) n a„(x, e')) 

d 

= J]A(A„,(x,e)nA„,(x,e')), 



where 



d d 

A„(x,e) = n^n,(x,e) and A„(x,e') = n^-j(x,0') 



On the other hand, we know (Fact 5.1[ ) that, for all j = 1, . . . , d, 

A(A„,(X,e)) = 2-^-(^'®), 

where, conditionally on X, i^„j(X, 9) has a binomial ^([logg fc„],p„j) distri- 
bution and, similarly, 

A(A„,(X,e'))^2-^".(^'«'), 
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where, conditionally on X, K'^j(X., 9') is binomial ;B(|"log2 kn],Pnj) and inde- 
pendent of Knj(X., O). In the rest of the proof, to lighten notation, we write 
Knj and K'^j instead of fC„j(X, 9) and K'^j(X., 9'), respectively. Clearly, 

A (A,,(X, 9) n v4„,(X, Q'))<2-^''<^^j'Kj) 



and, consequently, 

d 



n A (A„,(X, 9) n A„,(X, 9')) <2-ri°g2 k„] -Q 2- 



(^".-■?^;,) 



i=i 



i=i 



(since, by Fact 5.1 ^j^iKnj = \\0g2 kn])- Plugging this inequality into 
( |5.1 ) and applying Holder's inequality, we obtain 



E[f„(X)-r~.(X)f <i^^E 



n 
n 



■E 



J = l 

Jj2-(^"^-^^.)+|X 



E 



< ^E 

n 



■ d 



2-<i(i^„,-x;^.)+ I X 



Each term in the product may be bounded by technical Proposition 5J^ and 
this leads to 



E[f„(X)-f„(X)]^< 



2 . 288a^k '^ 



Tin 



I I min 1, 



i=i 



TT 



^ 288(t2A;„ -tT • ( 1 
< I I mm 1, 



im 



i=i 



16[log2/CnlPni(l - Pnj 

IT log 2 

16i\0gkn)Pnj{^-Pnj), 



l/2d> 



l/2d> 



Using the assumption on the form of the pnj , we finally conclude that 



2 \ S/2d 



E [f„(X) - r~„(X)]^ < Ca' (^-^ j (1 + i. 



Kr, 



n{\ogknY/^^' 



where 



C 



288 /7rlog2 
~V \ 16 



5/2d 
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and 



i65 



;i+en,rMi-5^ 



i"! i/2rf 



Clearly, the sequence (^„), which depends on the {{C,nj) : j G 5} only, tends 
to as n tends to infinity. 



5.3 Proof of Proposition |2.2 

We start with the decomposition 

E[f„(X)-r(X)]' 

n 

E ^Ee[iy™(X,e)](r(X,)-r(X)) 
1=1 

+ |^Ee[l^™(X,0)]-lJr(X 

n 

--E Ee ^lV„,(X,e)(r(X,)-r(X))+ f^W^„,(X,e)-lJr(X 

. i=l \ 1=1 / 

n / ^ \ 

J2 Wm{^, e) (r(X,) - r(X)) + ^ iy„,(X, 0) - 1 r(X 



<E 



i=l 



, «=1 



where, in the last step, we used Jensen's inequality. Consequently, 

E[f„(X)-r(X)]' 



< E 



< E 



^W^„,(X,0)(r(X,)-r(X)) 

2=1 

n 



E[r(X)l,.(x,e)]' 



2=1 



sup r (xj 

xe[o,i]'' 



p(£^(x,e)). 

(5.2) 



Let us examine the first term on the right-hand side of (5.2). Observe that 
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E 



-\ 2 



i=l 



< E 



< E 



by the Cauchy-Schwarz inequality, 

n 

J]l^™(X,0)(r(X,)-r(X)) 

n 

Y, VWm{^,e)VWm{^,e) |r(X,) - r(X) 
=1 

^ n \ / " 



-\ 2 



j=l 



,j=l 



i=l 



< E 



5^iy„,(X,0)(r(X,)-r(X))^ 



j=l 



(since the weights are subprobabihty weights). 

Thus, denoting by ||X||5 the norm of X evaluated over the components in S, 
we obtain 



But 



E 



j=i 



< E 



^W„,(X,e)(r(X,)-r(X)) 

n 

i=l 
n 

<L2^E[W^„,(X,e)||X,-X||2] 



i=l 



= nL2E[W^„i(X,e)||Xi-X||2] 
(by symmetry). 



E[Wni{X,Q)\\Xi-X\\l] 



E 
E 
E 



iXi-X 



|Xi-X| 



|2 -'-[XigA„(X,e)] 

'•^ iv„(x,e) 



L^-uCx.e) 



E 



Xi-X 



2 -'-[XieA„(x,e)] 

1 + Zir=2l[x>eA„(x,e)] 
2 l[XieA„(x,e)] 



1 + Zir=2 l[x,eA„(x,e)] 



X,Xi,0 
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Thus, 

E[i^„i(x,e)||Xi-x||y 



E 



E 



|Xi -X||5l[XieA„{x,0)]IE 



|Xi - X||5l[XieA„(x,0)]E 



1 



1 + Yh=2 l[x,eA„(x,e)] 
1 



x,Xi,e 



x,e 



_1 + X]r=2 l[x.eA„(x,e)] 
(by the independence of the random variables X, Xi, . . . , X„, 9). 

By Fact |5.2| and technical Lemma 5.1 

1 



E 



Consequently, 



1 + Y.l=2 l[x,eA„(x,e)] 



x,e 



2riog2fcnl 2k 

< < —^ 

n n 



E 



5^W„,(X,e)(r(X,)-r(X)) 

<2L2A;„E[||Xi-X||^l[x,eA„(x,0)]] 



Letting 






we obtain 



E 



5^iy™(X,e)(r(X,)-r(X)) 



i=l 



1^1 ^ I -"-rx^' 



xi^^eA„j{x,0)] 



< 2L^KJ2^ [|xi^') - X(^Yl[x,eA„(x,0)] 
= 2L^kn Y, E [pi(X, Xi, e)E^(, 

where, in the last equality, we set 

p,(X,Xi,0)= Y[ l[xWeA„,(x,0)]- 

t=l,...,d,i^j 

Therefore, using the fact that Xi is uniformly distributed over [0, l]*^, 

2 



E 



5^iy™(X,e)(r(X,)-r(X)) 



i=l 



< 2L^knY,^ [p,(X,Xi,e)A3(A„,(X,0 
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Observing that 



A(A„,(X,0))xEj. 
= A(A„(X,e)) 

(Fact [53), 



,{t) 



Xi^^t=l,..,d,t^j] 



[p,(x,Xi,e)] 



we are led to 



E 



5^H^™(X,0)(r(X,)-r(X)) 
<2L2^E[A2(A„,(X,e))] 
= 2L2J]e[2-2^-(^'®)] 



je5 



= 2L2^E[E[2-2^-(^'®)|X]], 

where, conditionally on X, Knj(X., Q) has a binomial i3([log2 fcnl,Pnj) distri- 
bution (Fact 5.1). Consequently, 



E 



^W^™(X,e)(r(X,)-r(X)) 



j=i 



<2L2^(l-0.75p„,: 



, [logj kr, 



log 2^ 



i65 

r2V^ / 0.75 , , 

<2L 2_^ exp ( -:f;377^Pnj log fcr 

2SL' 



(l+?nj) 



< 



nj- 



with 7„ = minje^ ^, 

To finish the proof, it remains to bound the second term on the right-hand 
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side of (5.2), which is easier. Just note that 

/ n 

P(£;:(X,e))=P K^l[x.eA„(x,e)]=0 



, i=l 



E 



F $^l[x.eA„(x,0)]=O|X,e 



, i=l 



(by Fact g 

^ -n/2k„ 



Putting all the pieces together, we finally conclude that 

'2 



ocr 

E[r~„(X)-r(X)]'<- 






(l+7n) 



+ 



sup r (xj 

xe[o,i]'' 



-n/2kn 



as desired. 



5.4 Some technical results 

The following result is an extension of Lemma 4.1 in Gyorfi et al. [26]. Its 
proof is given here for the sake of completeness. 

Lemma 5.1 Let Z he a binomial B{N,p) random variable, with p G (0, 1]. 
Then 



m 



\iii] 



E 



E 



1 + ^ 



< 



(iV + l)p' 



L[Z>1] 



< 



(iV + l)p' 



E 



1 



1 + Z2 



< 



(iV+l)(Ar + 2)p2 
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Proof of Lemma 5.1 To prove statement (z), we write 
1 



E 



l + Z 



N 



3=0 



+ J \J 



Er^f'^l^^a-^)"-^' 






N-j 



. N+l 



iV+1 
J 



p^(l-p) 



A^+i-i 



The second statement follows from the inequality 



E 



[z>i] 



<E 



l + Z 



and the third one by observing that 

N 



E 



1 + Z2 



Er^u)^(i-p) 



j=o 



+ J VJ 



N-j 



Therefore 



E 



1 + Z2 



1 ^ 1 

+ l)p^^ 1 



1 + J /AT + 1 



(iV + l)p^l+j2y^- + l 



pJ+l(l_p)A'-J 



N 



3 y-J_f^^ + iip,ti„_p)iv-,- 



Sr 



(iV+l)p^l+jV 3 



< 



{N + l){N + 2)p^ 

(by (0). 



Lemma 5.2 Lei Zi and Z2 be two independent binomial B{N,p) random 
variables. Set, for all z G C*, '^{z) = K[z^'^~^^]. Then 
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(z) For all z G C*, 



P(Z,-Z,=,) = ^/^d.-, 



(m) For all j G N, 



27ii Jy z^^^ 

where T is the positively oriented unit circle. 
(Hi) For all d > 1, 



24 f^ 
^ Jo 



Proof of Lemma 5.2 Statement {i) is clear and (u) is an immediate 
consequence of Cauchy's integral formula (Rudin [M])- To prove statement 
[Hi), write 



N 



j=0 

N 



j=Q 

oo 



<J22-'''F{Zi-Z2=j) 

j=0 

1 fifiz)^^-''^' 



E 



27ri /r z ^ — ' V z 



dz 



(by statement (ii)) 
1 r (^(e^^) 



27r ./-^ 1 - 2-'^e-^'' 

J6» 



d^ 



(by setting z = e*^, 6 G [-vr, vr]) 
^' ' ^^l + 2p(l-p)(cose-l)]^^^^^^d^ 



TT 



(by statement (i)). 
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Noting that 


e^' 




2d _ ^i6 






2dQi9 _ I 


22d . 


- 2^+1 COS ^ + 1 ' 




we obtain 










E [2^'^(^i^ 


-Z2)+- 








2d-: 
< 


L rn 

- [l + 2p(l- 
J -It 


- p){cos 


'^ 2'id _ 2^+1 COS ^ + 1 


The bound 


2d_ 


COS 6 


2'^ + ! 





d^. 



22d _ 2^+1 cos ^ + 1 - (2'i-l)2 
leads to 

<^^^^^3^|jl + 2p(l-p)(cos^-ird^ 

= ^^^^70 [i+2p(i-p)(-os^-ird^ 

:i^^"[l-4p(l-p)sin2(^/2)]^d^ 



7r(2'^ 

(cos^- 1 = -2sin2(^/2)) 
2^+1(2^ + 1) f^/2 



^-/ / [1 - 4p(l - p) sin^ e] de. 



Using the elementary inequality {1 — z)^ < e ^^ for z & [0, 1] and the change 
of variable 

t = tan(^/2), 

we finally obtain 

E [2-d(Zi-Z2n^ 

2'^+^(2'^ + i) r\ ( iQNp{i-p)e \ 1 



< Crf / exp (-4iVp(l - p)t^) dt, 
Jo 



with 

2^+2(^2'^ + 1 
Cd 



7r(2'^ - 1)2 
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The conclusion follows by observing that Cd < 24/7r for all rf > 1. 



Evaluating the integral in statement {iii) of Lemma 5.2 leads to the following 
proposition: 

Proposition 5.1 Let Zi and Z2 he two independent binomial B{N,p) ran- 
dom variables, with p G (0, 1). Then, for all d > 1, 



^^^-d(z,-z,u^ <^min('l, 



n 



16Np{l-p)J ■ 
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